Learning with noisy labels is one of the hottest problems in weakly-supervised learning. Based on memorization effects of deep neural networks, training on small-loss instances becomes very promising for handling noisy labels. This fosters the state-of-the-art approach "Co-teaching" that cross-trains two deep neural networks using the small-loss trick. However, with the increase of epochs, two networks converge to a consensus and Co-teaching reduces to the self-training MentorNet. To tackle this issue, we propose a robust learning paradigm called Co-teaching+, which bridges the "Update by Disagreement" strategy with the original Co-teaching. First, two networks feed forward and predict all data, but keep prediction disagreement data only. Then, among such disagreement data, each network selects its small-loss data, but back propagates the small-loss data from its peer network and updates its own parameters. Empirical results on benchmark datasets demonstrate that Co-teaching+ is much superior to many state-of-theart methods in the robustness of trained models.
translated by 谷歌翻译
Deep learning with noisy labels is practically challenging, as the capacity of deep models is so high that they can totally memorize these noisy labels sooner or later during training. Nonetheless, recent studies on the memorization effects of deep neural networks show that they would first memorize training data of clean labels and then those of noisy labels. Therefore in this paper, we propose a new deep learning paradigm called "Co-teaching" for combating with noisy labels. Namely, we train two deep neural networks simultaneously, and let them teach each other given every mini-batch: firstly, each network feeds forward all data and selects some data of possibly clean labels; secondly, two networks communicate with each other what data in this mini-batch should be used for training; finally, each network back propagates the data selected by its peer network and updates itself. Empirical results on noisy versions of MNIST, CIFAR-10 and CIFAR-100 demonstrate that Co-teaching is much superior to the state-of-the-art methods in the robustness of trained deep models. * The first two authors (Bo Han and Quanming Yao) made equal contributions. The implementation is available at https://github.com/bhanML/Co-teaching.32nd Conference on Neural Information Processing Systems (NIPS 2018),
translated by 谷歌翻译
Visual Question Answering (VQA) models often perform poorly on out-of-distribution data and struggle on domain generalization. Due to the multi-modal nature of this task, multiple factors of variation are intertwined, making generalization difficult to analyze. This motivates us to introduce a virtual benchmark, Super-CLEVR, where different factors in VQA domain shifts can be isolated in order that their effects can be studied independently. Four factors are considered: visual complexity, question redundancy, concept distribution and concept compositionality. With controllably generated data, Super-CLEVR enables us to test VQA methods in situations where the test data differs from the training data along each of these axes. We study four existing methods, including two neural symbolic methods NSCL and NSVQA, and two non-symbolic methods FiLM and mDETR; and our proposed method, probabilistic NSVQA (P-NSVQA), which extends NSVQA with uncertainty reasoning. P-NSVQA outperforms other methods on three of the four domain shift factors. Our results suggest that disentangling reasoning and perception, combined with probabilistic uncertainty, form a strong VQA model that is more robust to domain shifts. The dataset and code are released at https://github.com/Lizw14/Super-CLEVR.
translated by 谷歌翻译
In this work, we present a dense tracking and mapping system named Vox-Fusion, which seamlessly fuses neural implicit representations with traditional volumetric fusion methods. Our approach is inspired by the recently developed implicit mapping and positioning system and further extends the idea so that it can be freely applied to practical scenarios. Specifically, we leverage a voxel-based neural implicit surface representation to encode and optimize the scene inside each voxel. Furthermore, we adopt an octree-based structure to divide the scene and support dynamic expansion, enabling our system to track and map arbitrary scenes without knowing the environment like in previous works. Moreover, we proposed a high-performance multi-process framework to speed up the method, thus supporting some applications that require real-time performance. The evaluation results show that our methods can achieve better accuracy and completeness than previous methods. We also show that our Vox-Fusion can be used in augmented reality and virtual reality applications. Our source code is publicly available at https://github.com/zju3dv/Vox-Fusion.
translated by 谷歌翻译
虚拟内容创建和互动在现代3D应用中起着重要作用,例如AR和VR。从真实场景中恢复详细的3D模型可以显着扩大其应用程序的范围,并在计算机视觉和计算机图形社区中进行了数十年的研究。我们提出了基于体素的隐式表面表示Vox-Surf。我们的Vox-Surf将空间分为有限的体素。每个体素将几何形状和外观信息存储在其角顶点。 Vox-Surf得益于从体素表示继承的稀疏性,几乎适用于任何情况,并且可以轻松地从多个视图图像中训练。我们利用渐进式训练程序逐渐提取重要体素,以进一步优化,以便仅保留有效的体素,从而大大减少了采样点的数量并增加了渲染速度。细素还可以视为碰撞检测的边界量。该实验表明,与其他方法相比,Vox-Surf表示可以学习精致的表面细节和准确的颜色,并以更少的记忆力和更快的渲染速度来学习。我们还表明,Vox-Surf在场景编辑和AR应用中可能更实用。
translated by 谷歌翻译
我们研究了人类视觉系统(HVS)〜-〜形状,纹理和颜色〜-〜对对象分类的三个重要特征的贡献。我们构建了人形视觉引擎(HVE),该引擎明确和单独计算图像中的形状,纹理和颜色特征。然后将所得的特征向量连接以支持最终分类。我们表明,HVE可以总结和排序排序对对象识别的三个功能的贡献。我们使用人类实验来确认HVE和人类主要使用一些特定特征来支持特定类别的分类(例如,纹理是将斑马与其他四足动物区分开的主要特征,包括人类和HVE)。借助HVE的帮助,给定任何环境(数据集),我们可以总结整个任务的最重要功能(特定于任务的; (特定于类;为了证明HVE的更有用,我们使用它来模拟没有属性标签的人类的开放世界零射击学习能力。最后,我们表明HVE还可以通过不同特征的组合来模拟人类的想象力。我们将开源HVE引擎和相应的数据集。
translated by 谷歌翻译
我们描述了一种新的方法,该方法是基于与高级隐式语义特征的低级颜色和几何特征的汇总颜色和几何特征的室内识别。它使用了一个2阶段的深度学习框架,其中第一阶段经过了语义分割的辅助任务的训练,第二阶段的第二阶段使用了第一阶段的层中的特征来生成区分描述符以进行位置识别。辅助任务鼓励这些功能在语义上有意义,因此将RGB点云数据中的几何形状和颜色汇总为具有隐式语义信息。我们使用从扫描仪数据集派生的室内识别数据集进行培训和评估,其中一个包括由100个不同房间生成的3,608点云的测试集。与传统的基于功能的方法和四种最先进的深度学习方法进行比较表明,我们的方法显着优于所有五种方法,例如,取得前3名平均召回率为75%,而41%的平均召回率为41%最接近的竞争对手方法。我们的代码可在以下网址找到:https://github.com/yuhangming/semantic-indoor-place-recognition
translated by 谷歌翻译
Unsupervised domain adaptation (UDA) for semantic segmentation is a promising task freeing people from heavy annotation work. However, domain discrepancies in low-level image statistics and high-level contexts compromise the segmentation performance over the target domain. A key idea to tackle this problem is to perform both image-level and feature-level adaptation jointly. Unfortunately, there is a lack of such unified approaches for UDA tasks in the existing literature. This paper proposes a novel UDA pipeline for semantic segmentation that unifies image-level and feature-level adaptation. Concretely, for image-level domain shifts, we propose a global photometric alignment module and a global texture alignment module that align images in the source and target domains in terms of image-level properties. For feature-level domain shifts, we perform global manifold alignment by projecting pixel features from both domains onto the feature manifold of the source domain; and we further regularize category centers in the source domain through a category-oriented triplet loss and perform target domain consistency regularization over augmented target domain images. Experimental results demonstrate that our pipeline significantly outperforms previous methods. In the commonly tested GTA5$\rightarrow$Cityscapes task, our proposed method using Deeplab V3+ as the backbone surpasses previous SOTA by 8%, achieving 58.2% in mIoU.
translated by 谷歌翻译
Different people speak with diverse personalized speaking styles. Although existing one-shot talking head methods have made significant progress in lip sync, natural facial expressions, and stable head motions, they still cannot generate diverse speaking styles in the final talking head videos. To tackle this problem, we propose a one-shot style-controllable talking face generation framework. In a nutshell, we aim to attain a speaking style from an arbitrary reference speaking video and then drive the one-shot portrait to speak with the reference speaking style and another piece of audio. Specifically, we first develop a style encoder to extract dynamic facial motion patterns of a style reference video and then encode them into a style code. Afterward, we introduce a style-controllable decoder to synthesize stylized facial animations from the speech content and style code. In order to integrate the reference speaking style into generated videos, we design a style-aware adaptive transformer, which enables the encoded style code to adjust the weights of the feed-forward layers accordingly. Thanks to the style-aware adaptation mechanism, the reference speaking style can be better embedded into synthesized videos during decoding. Extensive experiments demonstrate that our method is capable of generating talking head videos with diverse speaking styles from only one portrait image and an audio clip while achieving authentic visual effects. Project Page: https://github.com/FuxiVirtualHuman/styletalk.
translated by 谷歌翻译
Witnessing the impressive achievements of pre-training techniques on large-scale data in the field of computer vision and natural language processing, we wonder whether this idea could be adapted in a grab-and-go spirit, and mitigate the sample inefficiency problem for visuomotor driving. Given the highly dynamic and variant nature of the input, the visuomotor driving task inherently lacks view and translation invariance, and the visual input contains massive irrelevant information for decision making, resulting in predominant pre-training approaches from general vision less suitable for the autonomous driving task. To this end, we propose PPGeo (Policy Pre-training via Geometric modeling), an intuitive and straightforward fully self-supervised framework curated for the policy pretraining in visuomotor driving. We aim at learning policy representations as a powerful abstraction by modeling 3D geometric scenes on large-scale unlabeled and uncalibrated YouTube driving videos. The proposed PPGeo is performed in two stages to support effective self-supervised training. In the first stage, the geometric modeling framework generates pose and depth predictions simultaneously, with two consecutive frames as input. In the second stage, the visual encoder learns driving policy representation by predicting the future ego-motion and optimizing with the photometric error based on current visual observation only. As such, the pre-trained visual encoder is equipped with rich driving policy related representations and thereby competent for multiple visuomotor driving tasks. Extensive experiments covering a wide span of challenging scenarios have demonstrated the superiority of our proposed approach, where improvements range from 2% to even over 100% with very limited data. Code and models will be available at https://github.com/OpenDriveLab/PPGeo.
translated by 谷歌翻译